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Abstract 

While statisticians are well-accustomed to performing exploratory analysis in the modeling stage 
of an analysis, the notion of conducting preliminary general-purpose exploratory analysis in the 
Monte Carlo stage (or more generally, the model-fitting stage) of an analysis is an area which we feel 
deserves much further attention. Towards this aim, this paper proposes a general-purpose algorithm 
for automatic density exploration. The proposed exploration algorithm combines and expands upon 
components from various adaptive Markov chain Monte Carlo methods, with the Wang-Landau algo- 
rithm at its heart. Additionally, the algorithm is run on interacting parallel chains - a feature which 
both decreases computational cost as well as stabilizes the algorithm, improving its ability to explore 
the density. Performance of this new parallel adaptive Wang-Landau (PAWL) algorithm is studied 
in several applications. Through a Bayesian variable selection example, the authors demonstrate the 
convergence gains obtained with interacting chains. The ability of the algorithm's adaptive proposal 
to induce mode-jumping is illustrated through a Bayesian mixture modeling application. Lastly, 
through a 2D Ising model, the authors demonstrate the ability of the algorithm to overcome the high 
correlations encountered in spatial models. The appendices contain the full algorithmic description 
in pseudo-code, a tri-modal toy example and remarks on the convergence of the proposed algorithm. 

1 Introduction 

As improvements in technology introduce measuring devices capable of capturing ever more complex 
real-world phenomena, the accompanying models used to understand such phenomena grow accordingly. 
While linear models under the assumption of Gaussian noise were the hallmark of early 20th century 
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statistics, the past several decades have seen an explosion in statistical models which produce com- 
plex and high-dimensional density functions for which simple, analytical integration is impossible. This 
growth was largely fueled by renewed interest in Bayesian statistics accompanying the Markov chain 
Monte Carlo (MCMC) revolution in the 1990's. With the computational power to explore the posterior 
distributions arising from Bayesian models, MCMC allowed practitioners to build models of increasing 
size and nonlinearity. 

As a core component of many of the MCMC algorithms discussed later, we briefly recall the Metropolis- 
Hastings algorithm. With the goal of sampling from a density ir, the algorithm generates a Markov chain 
( x t)t=i with invariant distribution ir. From a current state Xt, a new state x' is sampled using a proposal 
density q^Xt^x') parametrized by r\. The proposed state x' is accepted as the new state Xt+i of the 
chain with probability 



and if it is rejected, the new state Xt+i is set to the previous state Xt- From this simple algorithmic 
description, it is straightforward to see that if x t is in a local mode and the proposal density q v (x t ,x / ) 
has not been carefully chosen to propose samples from distant regions, the chain will become stuck in 
the current mode. This is due to the rejection of samples proposed outside the mode, underscoring the 
importance of ensuring q r) (xt,x / ) is intelligently designed. 

Though standard MCMC algorithms such as the Metropolis-Hastings algorithm and the Gibbs sam- 
pler have been studied thoroughly and the convergence to the target distribution is ensured under weak 
assumptions, many applications introduce distributions which cannot be sampled easily by these algo- 
rithms. Multiple reasons can lead to failure in practice, even if long-run convergence is guaranteed; the 
question then becomes whether or not the required number of iterations to accurately approximate the 
density is reasonable given the currently available computational power. Among these reasons, let us 
cite a few that will be illustrated in later examples: the probability density function might be highly 
multimodal, in which case the chain can get stuck in local modes. Alternatively or additionally, it might 
be defined on a high-dimensional state space with strong correlations between the components, in which 
case the proposal distributions (and in particular their large covariance matrices) are very difficult to tune 
manually. These issues lead to error and bias in the resulting inference, and may be detected through 
convergence monitoring techniques (see, e.g., Robert and Casella (2004)). However, even when conver- 
gence is monitored, it is possible that entire modes of the posterior are missed. To address these issues, 
we turn to a burgeoning class of Monte Carlo methods which we refer to as "exploratory algorithms." 

In the following section, we discuss the traits that allow exploratory MCMC algorithms to perform 
inference in multimodal, high-dimensional distributions, connecting these traits to existing exploratory 
algorithms in the process. In Section 3, we detail one of these, the Wang-Landau algorithm, and propose 
several novel improvements that make it more adaptive, hence easier to use, and also improve convergence. 
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Section 4 applies the proposed algorithm to variable selection, mixture modeling and spatial imaging, 
before Section 5 concludes. 

2 Exploratory Algorithms 

As emphasized by Schmidler (2011), there are two distinct goals of existing adaptive algorithms. Firstly, 
algorithms which adapt the proposal according to past samples are largely exploitative, in that they 
improve sampling of features already seen. However, modes or features not yet seen by the sampler might 
be quite different from the previously explored region, and as such adaptation might prevent adequate 
exploration of alternate regions of the state space. As an attempted solution to this problem Craiu 
et al. (2009) suggest adapting regionally, with parallel chains used to perform the adaptation. Secondly, 
there exists a set of adaptive algorithms whose goal is to adapt in such a way as to encourage density 
exploration. These include, for instance, the equi-energy sampler (Kou et al., 2006), parallel tempering 
(Swendsen and Wang, 1986; Geyer, 1991), and the Wang-Landau (Wang and Landau, 2001a,b; Liang, 
2005; Atchadc and Liu, 2010) algorithms among others. The algorithm developed here fits into the latter 
suite of tools, whose goal is to explore the target density, particularly distant and potentially unknown 
modes. 

Although the aforementioned algorithms have proven efficient for specific challenging inference prob- 
lems, they are not necessarily designed to be generic, and it is often difficult and time-consuming for 
practitioners to learn and code these algorithms merely to test a model. As such, while statisticians are 
accustomed to exploratory data analysis, we believe that there is room for generic exploratory Monte 
Carlo algorithms to learn the basic features of the distribution or model of interest, particularly the 
locations of modes and regions of high correlation. These generic algorithms would ideally be able to 
deal with discrete and continuous state spaces and any associated distribution of interest, and would 
require as few parameters to tune as possible, such that users can use them before embarking on time- 
consuming, tailor-made solutions designed to estimate expectations with high precision. In this way one 
may perform inference and compare between a wide range of models without building custom-purpose 
Monte Carlo methods for each. 

We first describe various ideas that have been used to explore highly-multimodal densities, and then 
describe recent works aimed at automatically tuning algorithmic parameters of MCMC methods, making 
them able to handle various situations without requiring much case-specific work from the user. 

2.1 Ability to Cross Low-Density Barriers 

The fundamental problem of density exploration is settling into local modes, with an inability to cross 
low-density regions to find alternative modes. For densities 7r which are highly multi-modal, or "rugged," 
one can employ tempering strategies, sampling instead from a distribution proportional to ■k v I t with 
temperature r > 1. Through tempering, the peaks and valleys of tt are smoothed, allowing easier 
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exploration. This is the fundamental idea behind parallel tempering, which employs multiple chains at 
different temperatures; samples are then swapped between chains, using highly tempered chains to assist 
in the exploration of the untempered chain (Geyer, 1991). Marinari and Parisi (1992) subsequently 
proposed simulated tempering, which dynamically moves a single chain up or down the temperature 
ladder. One may also fit tempering within a sequential Monte Carlo approach, whereby samples are 
first obtained from a highly tempered distribution; these samples are transitioned through a sequence of 
distributions converging to 7r using importance sampling and moves with a Markov kernel (Neal, 2001; 
Del Moral et al., 2006). However, using tempering strategies with complex densities, one must be careful 
of phase transitions, where the density transforms considerably across a given temperature value. 

A related class of algorithms works by partitioning the state space along the energy function — log ir(x). 
The idea of slicing, or partitioning, along the energy function is the hallmark of several auxiliary variable 
sampling methods, which iteratively sample U ~ U[0,tt(X)} then X ~ U{X : n(X) > U}. This is 
the fundamental idea behind the Swendsen-Wang algorithm (Swcndsen and Wang, 1987; Edwards and 
Sokal, 1988) and related algorithms (e.g. Bcsag and Green (1993); Higdon (1998); Neal (2003)). The 
equi-energy sampler (Kou et al., 2006; Baragatti et al., 2012), in contrast to the above auxiliary variable 
methods, begins by sampling from a highly tempered distribution; once convergence is reached, a new 
reduced-temperature chain is run with updates from a mixture of Metropolis moves and exchanges of 
the current state with the value of a previous chain in the same energy band. The process is continued 
until the temperature reaches 1 and the invariant distribution of the chain is the target of interest. As 
such, this algorithm works through a sequence of tempered distributions, using previous distributions to 
create intelligent mode-jumping moves along an equal-energy set. 

In a similar vein, the Wang-Landau algorithm (Wang and Landau, 2001a, b) also partitions the state 
space X along a reaction coordinate £,{x), typically the energy function: £(x) = — log7r(x), resulting 
in a partition (Xi)f =1 . The algorithm generates a time-inhomogeneous Markov chain that admits an 
invariant distribution 7r t at iteration t, instead of the target distribution 7r itself as e.g. in a standard 
Metropolis-Hastings algorithm. The distribution 7r t is designed such that the generated chain equally 
visits the various regions X\ as t — > oo. Because the Wang-Landau algorithm lies at the heart of our 
proposed algorithm, it is extensively described in Section 3. 

It is worth discussing a similar, recently proposed algorithm which combines Markov chain Monte 
Carlo and free energy biasing (Chopin et al., 2012) and its sequential Monte Carlo counterpart (Chopin 
and Jacob, 2010). The central idea of the latter is to explore a sequence of distributions, successively 
biasing according to a reaction coordinate £ in a similar manner. However, we have found the method 
to be largely dependent on selecting a well-chosen initial distribution ttq, as is usually the case with 
sequential Monte Carlo methods for static inference. If the initial distribution is not chosen to be flatter 
than the target distribution, which is possibly the case since the regions of interest with respect to the 
target distribution are a priori unknown, the efficiency of the SMC methods relies mostly on the move 
steps within the particle filter, which are themselves Metropolis-Hastings or Gibbs moves. 
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2.2 Adaptive Proposal Mechanism 

Concurrent with the increasing popularity of exploratory methods, the issue of adaptively fine-tuning 
MCMC algorithms has also seen considerable growth since the foundational paper of Haario et al. (2001), 
including a series of conferences dedicated solely to the problem (namely, Adap'ski 1 through 3 among 
others); see the reviews of Andrieu and Thorns (2008) and Atchade et al. (2011) for more details. While 
the de-facto standard has historically been hand-tuning of MCMC algorithms, this new work finds interest 
in automated tuning, resulting in a new class of methods called adaptive MCMC. 

The majority of the existing literature focuses on creating intelligent proposal distributions for an 
MCMC sampler. The principal idea is to exploit past samples to induce better moves across the state 
space by matching moments of the proposal and past samples, or by encouraging a particular acceptance 
rate of the sampler. The raison d'etre of these algorithms is that tuning MCMC algorithms by hand 
is both time-consuming and prone to inaccuracies. By automating the selection of the algorithm's 
parameters, practitioners might save considerable time in their analyses. This feature is pivotal in an 
automated density exploration algorithm. Due to its exploratory nature, it is likely that the practitioner 
might not have complete knowledge of even the scale of the density support; as a result, having a proposal 
distribution which adapts to the density at hand is a crucial step in the automation process. 

One must be careful in selecting the type of adaptation mechanism employed to encourage exploration, 
rather than simply exploiting previously explored modes. For instance, tuning a proposal covariance to 
a previously visited mode might prevent the algorithm from reaching as yet unexplored modes in the 
direction of the current mode's minor axis. Additionally, when combined with a progressively biased 
distribution as in the Wang-Landau algorithm, it is desirable to have a proposal which first samples 
what it sees well, then later grows in step size to better explore the flattened (biased) distribution. 

3 Proposed Algorithm 

We now develop our proposed algorithm. After recalling the Wang-Landau algorithm, which constitutes 
the core of our method, we describe three improvements: an adaptive binning strategy to automate the 
difficult task of partitioning the state space, the use of interacting parallel chains to improve the conver- 
gence speed and use of computational resources, and finally the use of adaptive proposal distributions to 
encourage exploration as well as to reduce the number of algorithmic parameters. We detail at the end 
of the section how to use the output of the algorithm, which we term parallel adaptive Wang-Landau 
(PAWL) to answer the statistical problem at hand. 

3.1 The Wang-Landau Algorithm 

As previously mentioned, the Wang-Landau algorithm generates a time-inhomogeneous Markov chain 
that admits a distribution n t as the invariant distribution at iteration t. The biased distribution 7f t 
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targeted by the algorithm at iteration t is based on the target distribution tt, and modified such that 
a) the generated chain visits all the sets (Xi)f =1 equally, that is the proportion of visits in each set is 
converging to c? _1 when t goes to infinity; and b) the restriction of the modified distribution 7j t to each set 
Xi coincides with the restriction of the target distribution tt to this set, up to a multiplicative constant. 
The modification (a) is crucial, as inducing uniform exploration of the sets is the biasing mechanism 
which improves exploration; in fact similar strategies are used in other fields, including combinatorial 
optimization (Wei et al., 2004). Ideally the biased distribution tt would not depend on t, and would be 
available analytically as: 

where ip(i) = J x ir(x)dx and XxJyX) is equal to 1 if x G X \ and otherwise. Checking that using tt 
as the invariant distribution of a MCMC algorithm would validate points a) and b) is straightforward. 
Figure 1 illustrates a univariate target distribution tt and its corresponding biased distribution tt under 
two different partitions of the state space. 

Original Density, with partition lines Biased by X Biased by Log Density 




10 15 -5 5 10 15 



Figure 1: Probability density functions for a univariate distribution 7r and its biased version tt when 
partitioning the state space along the x-axis = x, middle) and the log density (£(x) = — log7r(a;), 

right). The left-most plot also shows the partitioning of the state space with £(x); in both cases d = 20. 
The biasing is done such that the integral J x 7r(x)dx is the same for all Xi (areas under the curve for 
each set) and such that tt and tt coincide on each set Xi, up to a multiplicative constant. 

In practical situations, however, the integrals (ip(i))f = i are not available, hence we wish to plug 
estimates (9(i))f =1 of (ip{i))f = i into Equation (1). The Wang-Landau algorithm is an iterative algorithm 
which jointly generates a sequence of estimates (&t(i))t for all i and a Markov chain (X t )t, such that 
when t goes to infinity, 9t(i) converges to tp(i) and consequently, the distribution of X t converges to tt. 
We denote by TTg t the biased distribution obtained by replacing by its estimate 9t{i) in Equation (1). 
Note that the normalizing constant of TTg t is now unknown. A simplified version of the Wang-Landau 
algorithm is given in Algorithm 1. 
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Algorithm 1 Simplified Wang-Landau Algorithm 
l: Partition the state space into d regions {X\, ■ ■ . , X^} along a reaction coordinate £(x). 
2: First, Vi € {1, . . . , d} set 9(i) 4- 1. 
3: Choose a decreasing sequence {7*}, typically j t = lA- 
4: Sample Xq from an initial distribution ttq. 
5: for t = 1 to T do 

6: Sample X t from Pg t l (X t _ x, •), a transition kernel with invariant distribution ng t _ 1 (x). 
7: Update the bias: log0 t (£) 4- log0 t _i(i) +-f t (l Xt (X t ) - d" 1 ). 

8: Normalize the bias: 9 t (i) 4- O t (i)/Yu=i °t(i)- 
9: end for 



The rationale behind the update of the bias is that if the chain is in the set X il the probability of 
remaining in Xj should be reduced compared to the other sets through an increase in the associated bias 
t (i). Therefore the chain is pushed towards the sets that have been visited less during the previous 
iterations, improving the exploration of the state space so long as the partition (Xi)f =1 is well chosen. 
While this biasing mechanism adds cost to each iteration of the algorithm the tradeoff is improved 
exploration. In step 6, the transition kernel is typically a Metropolis-Hastings move, due to the lack of 
conjugacy brought about by biasing. 

In this simplified form the Wang-Landau algorithm reduces to standard stochastic approximation, 
where the term j t decreases at each iteration. The algorithm as given in Wang and Landau (2001a, b) 
uses a more sophisticated learning rate jt which does not decrease deterministically, but instead only 
when a certain criterion is met. This criterion, referred to as the "flat histogram" criterion, is met when 
for alH € {1, . . . , d}, v(i) is close enough to c?" 1 , where we denote by v(i) the proportion of visits of (Xt) 
in the set Xi since the last time the criterion was met. Hence we introduce a real number c to control 
the distance between v(i) and and an integer k to count the number of criteria already met. We 
describe the generalized Wang-Landau in Algorithm 2. 



Algorithm 2 Wang-Landau Algorithm 
1: Partition the state space into d regions {Xi, . . . , Xj} along a reaction coordinate £(x). 
2-. First, Vi e {1, . . . , d} set 0(i) 4- 1, u(i) 4- 0. 
3: Choose a decreasing sequence {7^}, typically 7fe = 1/fc. 
4: Sample Xq from an initial distribution ttq. 
5: for t = 1 to T do 

6: Sample X t from Pg t l (X t ^i, •), a transition kernel with invariant distribution ng t l (x). 

7: Update the proportions: Vi € {1, d} v(i) «— | [(t — +2#.(X t )]. 

8: if "flat histogram": max ie [ l d j \v(i) — < c/d then 

9: Set k 4- k + 1. 
10: Reset Vi G {1, d} v(i) 4- 0. 
11: end if 

12: Update the bias: log 9 t (i) 4- log0 t _i(i) +<y k (Z Xi (X t ) - d~ r ). 
13: Normalize the bias: 9 t (i) 4- 9t(i)/Y,i=i °t(i)- 
14: end for 



When c is set to low values (e.g. c = 0.1 or 0.5), the algorithm must explore the various regions 
such that the frequency of visits to the region Xi is approximately d^ 1 before the learning rate 7^ is 
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decreased. Also, the algorithm may be further generalized to target a desired frequency fa instead of the 
same frequency d _1 for every set; while such strategies may be useful, as demonstrated in the following 
section, for notational simplicity we focus on the case fa — dr 1 . As already mentioned, to answer 
the general question of exploring the support of a target density 7r, the default choice for the reaction 
coordinate is the energy function: £(x) = — log7r(cc), which has the benefit of being one-dimensional 
regardless of the dimension of the state space X . However, for specific models other reaction coordinates 
have been used, such as one (or more) of the components Xj of a; or a linear combination of components 
of x. In the applications in Section 4 we discuss the use of alternative reaction coordinates further. 
We now propose improvements to the Wang-Landau algorithm to increase its flexibility and efficiency. 

3.2 A Novel Adaptive Binning Strategy 

The Wang-Landau and equi-energy sampler algorithms are known to perform well if the bins, or partitions 
of the one-dimensional reaction coordinate £(x), are well chosen. However, depending on the problem it 
might be difficult to choose the bins to optimize sampler performance. A typical empirical approach to 
deal with this issue is to first run, for example, an adaptive MCMC algorithm to find at least one mode 
of the target distribution. The generated sample and the associated target density evaluations determine 
a first range of the target density values which can be used to initialize the bins. At this point the user 
can choose a wider range of target density values (e.g. by multiplying the range by 2), in order to allow 
for a wider exploration of the space. Within this initial range, one must still decide the number of bins. 

Due to difficulties with selecting the bins, it has been suggested that one should adaptively compare 
adjacent bins, splitting a bin if the corresponding estimate 9 is significantly larger than a neighboring value 
(Schmidler, 2011). Because each ijii is a given bin's normalizing constant, we feel it is more important to 
maintain uniformity within a bin to allow easy within-bin movement. Our proposed approach to achieve 
this "flatness" is to look at the distribution of the realized reaction coordinate values within each bin. 
Figure 2 illustrates this distribution on an artificial histogram. The plot of Figure 2(a) shows a situation 
where, within one bin, the distribution might be strongly skewed towards one side. In this artificial 
example, very few points have visited the left side of the bin, which suggests that moving from this bin 
to the left neighboring bin might be difficult. 

We propose to consider the ratio of the number of points on the left side of the middle (dashed 
line) over the number of points within the bin as a very simple proxy for the discrepancy of the chains 
within one bin (see e.g. Niederreiter (1992) for much more sophisticated discrepancy measures). In a 
broad outline, if this ratio was around 50%, the within-bin histogram would be roughly uniform. On the 
contrary, the ratio corresponding to Figure 2(a) is around 7%. Our strategy is to split the bin if this ratio 
goes below a given threshold, say 25%; two new bins are created, corresponding to the left side and the 
right side of the former bin, and each bin is assigned a weight of 9/2 where 9 is the weight of the former 
bin. These provide starting values for the estimation of the weight of the new bins during the following 
iterations of the algorithm. Note also that the desired frequency of visits to each of the new bins, which 
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(a) Unique bin before the split 



(b) Two bins after the split 



Figure 2: Artificial histograms of the log target density values associated to the chains generated by 
the algorithm, within a single bin (left) and within two bins created by splitting the former bin (right). 



was for instance equal to 1/d before the split, has to be specified as well. In the numerical experiments, 
we set the desired frequency of the new bins as one half of the desired frequency of the former bin. Figure 
2(b) shows the distribution of samples within the two new bins. The resulting histogram is not uniform, 
yet exhibits a more even distribution within the bin - a feature which is expected to help the chain to 
move from this bin to the left neighboring bin. The threshold could be set closer to 50%, which would 
result in more splits and therefore more bins. 

In practice it is not necessary to check whether the bins have to be split at every iteration. Our 
strategy is to check every n-th iteration, until the flat histogram criterion is met for the first time. When 
it is met, it means that the chains can move easily between the bins, and hence the bins can be kept 
unchanged for the remaining iterations. Finally, when implementing the automatic binning strategy for 
discrete distributions, one must ensure that a new bin corresponds to actual points in the state space. 
For example if the bins are along the energy values and the state space is finite, there are certainly 
intervals of energy to which no states corresponds, and that would therefore never be reached. Section 
4 demonstrates the proposed adaptive binning strategy in practice. 

In addition to allowing for splitting of bins, it is also important to allow the range of bins to extend if 
new states are found outside of the particular range. That said, one must differentiate between the two 
extremes of the reaction coordinate. For example, if £(x) — — log7r(a;), then one might not wish to add 
more low-density (high-energy) bins, which would induce the sampler to explore further and further into 
the tails. However, if one finds a new high-density (low-energy) mode beyond the energy range previously 
seen, then the sampler might become stuck in this new mode. In this case, we propose to extend the 
first bin corresponding to the lowest level of energy to always include the lowest observed values. The 



adaptive partition (Xi,t)iLi of the state space takes the following form at time t: 

X\.t = [emin,t> e l,t]i <*2,i = [ e l,t! e 2,t]j ■■■Xd t ,t = [ e d t -l,t> +°°) 

where e m i n .t = min t >o{— log7r(X t )} and (e^t, . . . , e ( i t -i.t) defines the limits of the inner bins at time t, 
which is the result of initial bin limits [&i,o, ■ ■ ■ , e<j -i,o)) an d possible splits between time and time t. 
As such, if new low-energy values are found, the bin Xi t is widened. If this results in unequal exploration 
across the reaction coordinate, then the adaptive bin-splitting mechanism will automatically split this 
newly widened bin. 

3.3 Parallel Interacting Chains 

We propose to generate multiple chains instead of a single one to improve computational scalability 
through parallelization as well as particle diversity. The use of interacting chains has become of much 
interest in recent years, with the multiple chains used to create diverse proposals (Casarin et al., 2011), to 
induce long-range equi-energy jumps (Kou et al., 2006), and to generally improve sampler performance; 
see Atchade et al. (2011), Brockwell et al. (2010), and Byrd (2010) for recent developments. The use of 
parallelization is not constrained to multiple chains, however, and has also been employed to speed up 
the generation of a single chain through pre-fetching (Brockwell, 2006). 

Let N be the desired number of chains. We follow Algorithm 2, with the following modifications. 
First we generate N starting points Xq = (Xq 1 ^ , . . . , X^ N ^) independently from an initial distribution ttq 
(Algorithm 2 line 4). Then at iteration t, instead of moving one chain using the transition kernel Pe t _ 1 , 
we move the N chains using the same transition kernel, associated with the same bias $t—\ (Algorithm 
2 line 6). We emphasize that the bias 9 is common to all chains, which makes the proposed method 
different from running Wang-Landau chains entirely in parallel. The proportions are updated using 
all the chains, simply by replacing the indicator function 2%. (X t ) by the mean TV -1 YljLi -^*« C^t ); that 
is the proportion of chains currently in set Xi (Algorithm 2 line 7). Likewise the update of the bias uses 
all the chains, again replacing the indicator function by the proportion of chains currently in a given set 
(Algorithm 2 line 12). We have therefore replaced indicator functions in the Wang-Landau algorithm by 
the law of the MCMC chain associated with the current parameter. Since this law is not accessible, we 
perform a mean field approximation at each time step. A similar expression has recently been employed 
by Liang and Wu (2011) for use in parallelizing the stochastic approximation Monte Carlo algorithm 
(Liang et al., 2007; Liang, 2009). Note that while we have designed the chains to communicate at each 
iteration, such frequent message passing can be costly, particularly on graphics processing units. In such 
situations, one could alter the algorithm such that the chains only communicate periodically. 

Our results (see Section 4) show that N interacting chains run for T iterations can strongly outperform 
a single chain run for N xT iterations, in terms of variance of the resulting estimates. Specifically, having 
a sample approximately distributed according to irg t (x) instead of a single point at iteration t improves 
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and stabilizes the subsequent estimate 8t+i- We explore the tradeoff between N and T in more detail in 
Section 4. 

Note that, while the original single-chain Wang-Landau algorithm was not straightforward to paral- 
lelize due to its iterative nature, the proposed algorithm can strongly benefit from multiple processing 
units: at a given iteration the N move steps can be done in parallel, as long as the results are consequently 
collected to update the bias before the next iteration. Therefore if multiple processors are available, as 
e.g. in recent central processing units and in graphics processing units (see e.g. Lee et al. (2010); Suchard 
and Rambaut (2009)), the computational cost can be reduced much more than what was possible with 
the single-chain Wang-Landau algorithm. To summarize, the proposed use of interacting chains can 
both improve the convergence of the estimates, regardless of the number of available processors, and 
additionally benefit from multiple processors. 

Finally, an additional benefit of using N parallel chains is that they can start from various points, 
drawn from the initial distribution ttq] hence if ttq is flat enough, the chains can start from different 
local modes, which improves de facto the exploration. However, we show in Section 4 that the chains 
still explore the space even if they start within the same mode, and hence the efficiency of the method 
does not rely on the choice of 7r , contrary to what we observed with sequential Monte Carlo methods. 
Additionally, because the sampler is attempting to explore both the state-space as well as the range of 
the reaction coordinate simultaneously, our parallel formulation allows the sampler to borrow strength 
between chains, providing for exploration of the reaction coordinate without having to move a single chain 
across potentially large and high-dimensional state-spaces to traverse the reaction coordinate values. 

3.4 Adaptive Proposal Mechanism 

As discussed earlier, it is important to automate the proposal mechanism to improve movement across 
the state space. A well-studied proxy for optimal movement is the algorithm's Metropolis-Hastings 
acceptance rate. Too low an acceptance rate signifies the algorithm is attempting to make moves that 
are too large, and are therefore rejected. Too high an acceptance rate signifies the algorithm is barely 
moving. As such, we suggest adaptively tuning the proposal variance to encourage an acceptance rate of 
0.234 as recommended in Roberts et al. (1997), although we have found settings in the range 0.1 to 0.5 to 
work well in all examples tested. The Robbins-Monro stochastic approximation update of the proposal 
standard deviation o~ t is as follows: 

a t+l =a t + Pt (21(A > 0.234) - 1) (2) 

where t is the current iteration of the algorithm, p t is a decreasing sequence (typically p t — l/t), and 
A is the acceptance rate (proportion of accepted moves) of the particles. Through this update, the 
proposal variance grows after samples are accepted, and shrinks when samples are rejected, encouraging 
exploration of the state space. 
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Another approach to adaptively tuning the proposal distribution is to use the following mixture of 
Gaussian random- walks: 

X* ~ Wl Af (x t -!, ^^S t ) + w 2 U (x t -x, ^f-Jp) (3) 

with w\ + W2 — 1, the empirical covariance of the chain history - an estimator of the covariance 
structure of the target - and I p the p x p identity matrix where p is the dimension of the target space. 
The first component of this mixture makes the proposal adaptive and able to learn from the past, while 
the second component helps to explore the space. For instance, if the chain is stuck in a mode, the first 
component's variance might become small, yet the second component guarantees a chance to eventually 
escape the mode. Hence the second component acts as a "safety net" and therefore its weight is small, 
typically u> 2 = 0.05, and its standard deviation aj may be set large to improve mixing (Guan and Krone, 
2007). 

In our context where parallel chains are run together, we use all the chains to estimate the empirical 
covariance S t at each iteration. Note that the computation of this covariance does not require the 
storage of the whole history of the chain and can be done at constant cost, since recurrence formulae 
exist to compute the empirical covariance, as explained for instance in Welford (1962). The value 2.38 2 is 
justified by asymptotic optimality reflections on certain classes of models (see, e.g., Roberts et al. (1997) 
and Roberts and Rosenthal (2009)). 

3.5 Using the Output to Perform Inference 

While the resulting samples from the proposed algorithm PAWL are not from n, but rather an approxi- 
mation of the biased version (1), one can use importance sampling or advanced sequential Monte Carlo 
ideas to transition the samples to ir (see Chopin and Jacob (2010) for details). Alternatively, the samples 
from the exploratory algorithm can be used to seed a more traditional MCMC algorithm, as advocated 
by Schmidlcr (2011). 

The pseudo-code for PAWL, combining the parallel Wang-Landau algorithm with adaptive binning 
and proposal mechanisms, is given in the Appendix. Before proceeding to examples, it is important to 
reiterate the importance of the values ip(i) — J x it(x)dx. Specifically, certain choices of the reaction 
coordinate (,(x) result in having inherent value. For example, it is possible in a model selection 
application to use the model order as £(x), in which case the values could be employed to calculate 
Bayes factors and other quantities of interest. 

4 Applications 

We now demonstrate PAWL applied to three examples including variable selection, mixture modeling, 
and spatial imaging. A fourth pedagogical example is available in the appendices. In each application 
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we walk through our proposed algorithm (described explicitly as Algorithm 3 in the Appendix), first 
running preliminary (adaptive) Metropolis-Hastings MCMC to determine the initial range for the reaction 
coordinate £ and initial values for the proposal parameters and starting state of the interacting Wang- 
Landau chains. This range is then increased to encourage exploration of low-density regions of the space, 
and an initial number of bins is specified. Once this groundwork is set, the same Metropolis-Hastings 
algorithm run in the preliminary stage is embedded within the PAWL algorithm. 

4.1 g-Prior Variable Selection 

We proceed by conducting variable selection on the pollution data set of McDonald and Schwing (1973), 
wherein mortality is related to pollution levels through 15 independent variables including mean annual 
precipitation, population per household, and average annual relative humidity. Measured across 60 
metropolitan areas, the response variable y is the age-adjusted mortality rate in the given metropolitan 
area. Our goal is to identify the pollution-related independent variables which best predict the response. 
With 15 variables, calculating the posterior probabilities of the 32, 768 models exactly is possible but 
time-consuming. We have chosen this size of data set to provide for difficult posterior exploration, yet 
allow a study of convergence of 9 towards ip. 

With an eye towards model selection, we introduce the binary indicator variable 7 € {0, 1} P , where 
7j = 1 means the variable Xj is included in the model. As such, 7 can describe all of the 2 P possible 
models. Consider the normal likelihood 

y| M ,X,/3,a 2 ~iV„(X/3,a 2 I n ). (5) 

If X 7 is the model matrix which excludes all Xj's if "fj — 0, we can employ the following prior 
distributions for (3 and a 1 (Zellner, 1986; Marin and Robert, 2007): 



7r(/3 7 ,a 2 | 7 ) oc ( CT 2)-(^+i)/2-i ex p 



1 qTz-v-T 



2ga 



2 ^7 (-^-7 X 7 )/3 7 



where q 7 = 1^7 represents the number of variables in the model. While selecting g can be a difficult 
problem, we have chosen it to be very large {g = exp(20)) to induce a sparse model, which is difficult to 
explore due to the small marginal probabilities of most variables. After integrating over the regression 
coefficients (3, the posterior density for 7 is thus 



7r(7|y,X) cx ( 5 + l) 



-(S-y+l)/2 



/2 

y T y - -4t y T x 7 (x^x 7 )- 1 x 7 y 

g + l 7 



(6) 



While we select the log energy function — log7r(x) as the reaction coordinate for our analysis, it 
is worth noting that many other options exist. For instance, it would be natural to consider the model 
saturation q-y/p, which would ensure exploration across the different model sizes. However, we select 
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Figure 3: Variable selection example: convergence of Wang-Landau for N = 1, 10, 100. Iterations set 
such that each algorithm runs in 2 minutes (±5 seconds). 9 for each bin shown in solid lines. True values 
(■0) shown as dotted lines. 

£(x) — — log 7r(cc) to emphasize the universality of using the energy function as the reaction coordinate. 

We first run a preliminary Metropolis-Hastings algorithm which flips a variable on/off at random, 
accepting or rejecting the flip based on the resulting posterior densities. Due to high correlation between 
variables, a better strategy might be to flip multiple variables at once; however, we restrain from exploring 
this to demonstrate PAWL's ability to make viable even poorly designed Metropolis-Hastings algorithms. 
The preliminary algorithm run found values 377 < — log7r(a:) < 410, which we extend slightly to create 
20 equally spaced bins in the range [377, 450] . It is worth reiterating that the resulting samples generated 
from PAWL are from a biased version of 7r; as such, importance sampling techniques could be used to 
recover n, or the samples obtained could be used to seed a more traditional MCMC algorithm. 

Due to the size of the problem, we are able to enumerate all posterior values, and hence may calcu- 
late ip exactly. As such, we begin by examining the effect of the number of particles N on the parallel 
Wang-Landau algorithm. To further focus on this aspect, we suppress adaptive binning and proposals for 
this example. Figure 3 shows the convergence of 9 to \& for N = 1, 10, 100. We see that the algorithm's 
convergence improves with more particles. Using N — 100 particles, we now examine PAWL compared 
to the Metropolis-Hastings algorithm (run on N chains) mentioned above on the unnormalized targets 
7r, 7T 1 / 10 , 7T 1 / 100 . Consider Figure 4; on the target distribution n, the Metropolis-Hastings algorithm be- 
comes stuck in high-probability regions. However, on the tempered distributions, the algorithm explores 
the space more thoroughly, although not to the same level as PAWL. Specifically, PAWL explores a much 
wider range of models, including the highest probability models, whereas the tempered distributions do 
not. Here the Wang-Landau algorithm as well as the Metropolis-Hastings algorithm both use N = 100 
chains for T = 3500 iterations, the former taking 253 ± 13 seconds and the latter taking 247 ± 15 seconds 
across 10 runs, indicating that the additional cost for PAWL is negligible. 



14 



Wang- Landau 


Metropolis-Hastings, Temp = 1 






■ 














III! I III wi 


whit 


HIT H|hh 






Metropolis-Hastings, Temp = 10 Metropolis-Hastings, Temp = 100 






mmS ^ m 








- 








500 1000 1500 2000 


2500 3000 3500 500 1000 1500 2000 2500 3000 3500 

Iteration 



Figure 4: Variable selection example: exploration of model space by PAWL and Metropolis-Hastings at 
3 temperature settings, (a) Model Saturation (proportion of non-zero variables in model) as function of 
algorithm iterations. The solid lines are the mean of N = 100 chains, while the shaded regions are the 
middle 95% of the chains. 
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4.2 Mixture Modeling 

Mixture models provide a challenging case of multimodality, due partly to the complexity of the model 
and partly to a phenomenon called "label switching" (see e.g. Fruhwirth-Schnattcr (2006) for a book 
covering Bayesian inference for these models, Diebolt and Robert (1994) and Richardson and Green 
(1997) for seminal papers using MCMC for mixture models, and Stephens (2000) and Jasra et al. (2005) 
on the label switching problem). Articles describing explorative MCMC algorithms often take these 
models as benchmarks, as e.g. population MCMC and SMC methods in Jasra et al. (2007), the Wang- 
Landau algorithm in Atchade and Liu (2010), free energy methods in Chopin et al. (2012); Chopin and 
Jacob (2010), and parallel tempering with equi-energy moves in Baragatti et al. (2012). 
Consider a Bayesian Gaussian mixture model, i.e. for i = 1, . . . n, 

K 

p(y t \q,fi,X) = ^qkifiy^fii^X^ 1 ), 

k=l 

where (p is the probability density function of the Gaussian distribution, K is the number of components, 
qk, Hk and are respectively the weight, the mean and the precision of component k. The component 
index k is also called its label. Following Richardson and Green (1997), the prior is taken as, for 
k = l,...,K, 

/ifc ~ N(Af, K~ ), At ~ Gamma(a,/3), 

/3 ~ Gamma(j, h), (qi, . . . , qx-i) ~ Dirichletj<-(1, . . . , 1) 

with, e.g., k = 4/i? 2 , a = 2, g = 0.2, h = lOOg/aR 2 , M — y, R =range(j/). 

The invariance of the likelihood to permutations in the labelling of the components leads to the "label 
switching" problem: since there are K\ possible permutations of the labels, each mode has necessarily 
K\— 1 replicates. We emphasize that this model has been thoroughly studied and is hence well-understood 
from a modeling point of view, but it still induces a computationally challenging sampling problem for 
which difficulty can be artificially increased through the number of components K. 

Note that in this parametrization /?, the rate of the Gamma prior distribution of the precisions 
is estimated along with the parameters of interest qi-K—ii I^i-.k and Xi.k- Chopin et al. (2012); Chopin 
and Jacob (2010) suggest that /3 can be used as a reaction coordinate, since a large value of /3 results 
in a small precision and hence in a flatter posterior distribution of the other parameters, which is easier 
to explore than the distribution associated with smaller values of /3; we refer to these articles for further 
exploration of this choice of reaction coordinate, and instead default to £(a?) = — log7r(x). 

We create a synthetic 100-sample from a Gaussian mixture with k = 4 components, weights 1/4, 
means —3, 0, 3, 6 and variances 0.55 2 as in Jasra et al. (2005). The goal is to explore the highly 
multimodal posterior distribution of the 13-dimensional parameter 8 — (wi-a, /j,i-a, \ia, /?) where Wk is 
the unnormalized weight: qk = W k/Ylk=i Wk ~ Unnormalized weights may be handled straightforwardly 
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in MCMC algorithms since they are defined on K + and not on the A"-simplex as with the q^. 

The proposed algorithm is compared to a Sequential Monte Carlo sampler (SMC) and a parallel 
adaptive Metropolis-Hastings (PAMH) algorithm, that we detail below. We admittedly use naive versions 
of these competitors, arguing that most improvements of these could be carried over to PAWL. For 
instance, a mixture of Markov kernels as suggested for the SMC algorithm in Section 3.2 of Jasra et al. 
(2007) can be used in the proposal distribution of PAWL; and since PAWL is a population MCMC 
algorithm, exchange and crossover moves could be used as well, as suggested for the Population SAMC 
algorithm in Liang and Wu (2011). To get a plausible range of values for the reaction coordinate of the 
proposed algorithm without user input, an initial adaptive MCMC algorithm is run with N = 10 chains 
and T lmt = 1, 000 iterations. The initial points of these chains are drawn from the prior distribution of 
the parameters. This provides a range of log density values, from which we compute the 10% and 90% 
empirical quantiles, denoted by gio and qgo respectively. In a conservative spirit, the bins are chosen to 
equally divide the interval [qw, qio + 2(<7oo — <Zio)] m 20 subsets. Hence the algorithm is going to explore 
log density values in a range that is approximately twice as large as the values initially explored. Note 
that we use quantiles instead of minimum and maximum values to make the method more robust. 

Next, PAWL itself is run for T = 200, 000 iterations, starting from the terminal points of the N 
preliminary chains, resulting in a total number of N(T + Tj n i t ) target density evaluations. In this 
situation, even with only 100 data points, most of the computational cost goes into the evaluation of 
the target density. This confirms that algorithmic parameters such as the number of bins does not 
significantly affect the overall computational cost, at least as long as the target density is not extremely 
cheap to evaluate. The adaptive proposal is such that it targets an acceptance rate of 23.4%. Meanwhile 
the PAMH algorithm using the same adaptive proposal is run with N = 10 chains and T* = 250, 000 
iterations, hence relying on more target density evaluations for a comparable computational cost. 

Finally, the SMC algorithm is run on a sequence of tempered distribution (TTk)k=ii each density being 
defined by: 

ir k (x) oc 7T^* (x)pQ k (x) 

where po is an initial distribution (here taken to be the prior distribution), and = k/K. The number of 
steps K is set to 100 and the number of particles to 40, 000. When the Effective Sample Size (ESS) goes 
below 90%, we perform a systematic resampling and 5 consecutive Metropolis-Hastings moves. We use a 
random walk proposal distribution, which variance is taken to be cE where £ is the empirical covariance 
of the particles and c is set to 10%; see Jasra et al. (2007) for more details. The parameters are chosen 
to induce a computational cost comparable to the other methods. However for the SMC sampler the 
number of target density evaluations is a random number, since it depends on the random number of 
resampling steps: the computational cost is in general less predictable than using MCMC. 

First we look at graphical representations of the generated samples. Figure 5 shows the resulting 
points projected on the (pi, /12) plane, restricted on [—5, 9] 2 . In this plane there are 12 replicates of each 
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mode, indicated by target symbols in Figure 5(a). These projections do not allow one to check that 
all the modes were visited since they project the 13-dimensional target space on a 2-dimensional space. 
Figure 5(b) shows that the adaptive MCMC method clearly misses some of the modes, while visiting 
many others. Figure 5(c) shows how the chains generated by the modified Wang-Landau algorithm easily 
explore the space of interest, visiting both global and local modes. To recover the main modes from these 
chains, we use the final value of the bias, 6t, as importance weights to correct for the bias induced by 
the algorithm; in Figure 5(c) the importance weights define the transparency of each point: the darker 
the point, the more weight it has. Finally Figure 5(d) shows how the SMC sampler also put particles in 
each mode; again the transparency of the points is proportional to their weights. 

We now turn to more quantitative measures of the error made on marginal quantities. Since the 
component means admit 4 identical modes around —3, 0, 3 and 6, we know that their overall mean is 
approximately equal to fi* = 1.5. We then compute the following error measurement: 



where jlk is the mean of the generated sample (taking the weights into account for PAWL and SMC). 
Table 1 shows the results averaged over 10 independent runs: the means of each component, the error 
defined above and the (wall-clock) run times obtained using the same CPU, along with their standard 
deviations. The results highlight that in this context PAWL gives more precise results than SMC, for 
the same or less computational cost; the comparison between parallel MCMC and SMC confirms the 
results obtained in Jasra et al. (2007). The small benefit of PAWL over PAMH can be explained by 
considering the symmetry of the posterior distribution: even if some modes are missed by PAMH as 
shown in Figure 5(b), the approximation of the posterior expectation might be accurate, though the 
corresponding variance will be higher. 



Method 


Mi 


M2 


M3 


fi 4 


Error 


Time (s) 


PAWL 
PAMH 
SMC 


1.42 ± 0.99 
1.58 ± 0.81 
1.00 ± 1.96 


1.42 ± 0.58 
1.25 ± 0.72 
2.99 ± 1.38 


1.39 ± 0.90 
1.04 ± 1.07 
0.92 ± 2.27 


1.75 ± 0.78 

2.09 ± 1.00 

1.10 ± 2.11 


1.50 ± 0.59 
1.75 ± 0.80 
3.89 ± 1.34 


209 ± 1 
233 ± 1 
269 ± 7 



Table 1: Estimation of the means of the mixture components, for the proposed method (PAWL), Parallel 
Adaptive Metropolis-Hastings (PAMH) and Sequential Monte Carlo (SMC), using the prior as initial 
distribution. Quantities averaged over 10 independent runs for each method. 

Next we consider a more realistic setting, where the initial distribution is not well spread over the 
parameter values: instead of taking the prior distribution itself, we use a similar distribution but with 
an hyperparameter k equal to 1 instead of 4/i? 2 , which for our simulated data set is equal to 0.03. This 
higher precision makes the initial distribution concentrated on a few modes, instead of being fairly flat 
over the whole region of interest. We keep the prior unchanged, so that the posterior is left unchanged. 
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For PAWL and PAMH, this means that the initial points of the chains are all close one to another; and 
likewise for the initial particles in the SMC sampler. The results are shown in Table 2, and illustrate 
the degeneracy of SMC when the initial distribution is not well-chosen; though this is not surprising, 
this is important in terms of exploratory algorithms when one does not have prior knowledge of the 
region of interest. Both parallel MCMC methods give similar results as with the previous, flatter initial 
distribution. 



Method 


Mi 


M2 


M3 


H4 


Error 


Time 


PAWL 
PAMH 

SMC 


1.16 ± 0.75 
1.37 ± 0.73 
0.35 ± 2.13 


2.04 ± 0.50 
1.48 ± 1.39 
0.82 ± 1.55 


1.72 ± 0.80 
1.71 ± 0.81 
3.19 ± 2.41 


1.07 ± 1.22 
1.44 ± 1.11 
1.62 ± 1.85 


1.48 ± 1.10 
1.75 ± 1.01 
4.17 ± 1.41 


210 ± 1 
234 ± 1 
337 ± 8 



Table 2: Estimation of the means of the mixture components, for the proposed method (PAWL), Parallel 
Adaptive Metropolis-Hastings (PAMH) and Sequential Monte Carlo (SMC), using a concentrated initial 
distribution. Quantities averaged over 10 independent runs for each method. 

Finally, we compare different algorithmic settings for the PAWL algorithm, changing the number of 
chains and the number of iterations. The results are shown in Table 3. First we see that, even on a 
single CPU, the computing time is not exactly proportional to N x T, the number of target density 
evaluation. Indeed the computations are vectorized by iteration, and hence it is typically cheaper to 
compute one iteration of N chains than TV iterations of 1 chain; although this would not hold for every 
model. We also see that the algorithm using only one chain failed to explore the modes, resulting in a 
huge final error. Finally we see that with 50 chains and only 50, 000 iterations, the algorithm provides 
results of approximately the same precision as with 10 chains and 200, 000 iterations. This suggests 
that the algorithm might be particularly interesting if parallel processing units are available, since the 
computational cost would then be much reduced. 



Parameters 


Mi 


M2 




M4 


Error 


Time 


N = 1 

T = 5 x 10 5 


0.37 ± 3.46 


2.01 ± 3.27 


2.53 ± 3.04 


0.95 ± 3.46 


6.39 ± 1.30 


265 ± 40 


N = 10 

T = 2 x 10 5 


1.42 ± 0.99 


1.42 ± 0.58 


1.39 ± 0.90 


1.75 ± 0.78 


1.50 ± 0.59 


209 ± 1 


N = 50 

T = 5 x 10 4 


1.51 ± 0.88 


1.5 ± 0.9 


1.65 ± 0.64 


1.31 ± 0.31 


1.22 ± 0.69 


178 ± 2 



Table 3: Estimation of the means of the mixture components, for the proposed method (PAWL), for 
different values of N, the number of chains, and T, the number of iterations. Quantities averaged over 
10 independent runs for each set of parameters. 

4.3 Spatial Imaging 

We finish our examples by identifying ice floes from polar satellite images as described in Banfield and 
Raftery (1992). Here the image under consideration is a 200 by 200 gray-scale satellite image, with focus 
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(a) Locations of the global modes of the posterior (b) Projection of the chains generated by the parallel 
distribution projected on (li\,li2) adaptive MH algorithm on (111,1x2) 
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(c) Projection of the chains generated by PAWL on (d) Projection of the particles generated by the SMC 
(/J,!, algorithm on (m, li 2 ) 

Figure 5: Mixture model example: exploration of the posterior distribution projected on the //i,//2 
plane, using different algorithms. 
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Figure 6: Spatial model example: (a) original ice floe image with highlighted region, (b) close-up of 
focused region. 

on a particular 40 by 40 region (y, Figure 6); the goal is to identify the presence and position of polar 
ice floes (x). Towards this goal, Higdon (1998) employs a Bayesian model. Basing the likelihood on 
similarity to the image and employing an Ising model prior, the resulting posterior distribution is 



The first term, the likelihood, encourages states x which are similar to the original image y. The second 
term, the prior, favors states x for which neighbouring pixels are equal. Here neighbourhood (~) is 
defined as the 8 vertical, horizontal, and diagonal adjacencies of each (interior) pixel. 

Because the prior strongly prefers large blocks of identical pixels, an MCMC method which proposes 
to flip one pixel at a time will fail to explore the posterior, and hence Higdon (1998) suggests a partial 
decoupling technique specific to these types of models. However, to demonstrate PAWL's power and 
universality, we demonstrate its ability to make simple one-at-a-time Metropolis-Hastings feasible in 
these models without more advanced decoupling methods. 

First running a preliminary Metropolis-Hastings algorithm of length 20, 000, we use the range of 
explored energy values divided evenly across 10 bins. The algorithm subsequently splits bins 6 times (with 
splitting stopped once the algorithm reaches the extremes of the reaction coordinate values) resulting in 
17 bins at the algorithm's conclusion. For both algorithms, we run 10 chains for 1, 000, 000 iterations with 
model parameters a = 1, ft = 0.7. Due to the flip-one-pixel approach, we suppress adaptive proposals for 
this example. In contrast to the mixture modeling example, in this example the target density is fairly 
straightforward to calculate, so it is a good worst-case comparison to demonstrate the additional time 
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Figure 7: Spatial model example: states explored over 400,000 iterations for Metropolis-Hastings (top) 
and proposed algorithm (bottom). 

taken by the proposed algorithm. For this example, the MH algorithm took 388 ± 21 seconds across 10 
runs, whereas PAWL required 478 ± 24 seconds. Thus in this case the Wang-Landau adds a 23% price to 
each iteration on average. However, as we will show, the exploration is significantly better, justifying the 
slight additional cost. Figure 7 shows a subset of the last 400, 000 posterior realizations from one chain 
of each algorithm. We see that the proposed Wang-Landau algorithm encourages much more exploration 
of alternate modes. The corresponding average state explored over all 10 chains (after 400, 000 burn- 
in) is shown in Figure 8. From this we see that Wang-Landau induces exploration of the mode in 
the top-left of the region in question, as well as a bridge between the central ice floes. In conclusion, 
while flip-one-pixel Metropolis-Hastings is incapable of exploring the modes in the posterior caused by 
the presence/absence of large ice floes, the proposed algorithm encourages exploration of these modes, 
even in the presence of high between-pixel correlation. While Higdon (1998) develops a custom-tailored 
MCMC solution to overcome the inability of Metropolis-Hastings to adequately explore the posterior 
density in Ising models, we employ PAWL - a general-purpose automatic density exploration algorithm 
- to achieve similar results. 



5 Discussion and Conclusion 

The proposed algorithm, PAWL, has at its core the Wang-Landau algorithm which, despite wide-spread 
use in the physics community, has only recently been introduced into the statistics literature. A well- 
known obstacle in implementing the Wang-Landau algorithm is selecting the bins through which to 
discretize the state space; in response, we have developed a novel adaptive binning strategy. Additionally, 
we employ an adaptive proposal mechanism to further reduce the amount of user-defined parameters. 
Finally, to improve the convergence speed of the algorithm and to exploit modern computational power, 
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Figure 8: Spatial model example: average state explored with Metropolis-Hastings (left) and PAWL 
after importance sampling (right). 

we have developed a parallel interacting chain version of the algorithm which proves efficient in stabilizing 
the algorithm. Through a host of examples, we have demonstrated the algorithm's ability to conduct 
density exploration over a wide range of distributions continuous and discrete. While a suite of custom- 
purposed MCMC tools exist in the literature for each of these models, the proposed algorithm handles 
each within the same unified framework. 

As practitioners in fields ranging from image-processing to astronomy turn to increasingly complex 
models to represent intricate real- world phenomena, the computational tools to approximate these models 
must grow accordingly. In this paper, we have proposed a general-purpose algorithm for automatic 
density exploration. Due to its fully adaptive nature, we foresee its application as a black-box exploratory 
MCMC method aimed at practitioners of Bayesian methods. While statisticians are well-accustomed to 
performing exploratory analysis in the modeling stage of an analysis, the notion of conducting preliminary 
general-purpose exploratory analysis in the Monte Carlo stage (or more generally, the model-fitting stage) 
of an analysis is an area which we feels deserves much further attention. As models grow in complexity, 
and endless model-specific Monte Carlo methods are proposed, it is valuable for the practitioner to have 
a universally applicable tool to throw at their problem before embarking on custom-tuned, hand-built 
Monte Carlo methods. Towards this aim, the authors have published an R package ("PAWL") to minimize 
user effort in applying the proposed algorithm to their specific problem. 
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A Details of Proposed Algorithm 

In Algorithm 3 we detail PAWL, fusing together a Wang-Landau base with adaptive binning, interacting 
parallel chains, and an adaptive proposal mechanism. In comparison to the generalized Wang-Landau 
algorithm (Algorithm 2), when a fiat histogram is reached the distribution of particles within bins is 
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tested to determine whether a given bin should be split. In addition, a suite of N interacting chains is 
employed, and hence the former chain X t is now made of TV chains: X t = {X\ f , . . . , X^), each defined 
on the state space X. All the N chains are used to update the bias 9 t , as described in Section 3.3. 

The chains are moved using an adaptive mechanism determined by the Metropolis-Hastings accep- 
tance rate as explained in Section 3.4. While we present Algorithm 3 with adaptive proposal variance, 
it may also be implemented with an adaptive mixture proposal as described in Section 3.4. Note that 
when a bin is split, it is possible to set the desired frequency of the new bins to some reduced value, say 
each obtaining half the desired frequency of the original - in fact in the numerical experiments we do 
exactly that. However, for notational and pedagogical simplicity, we present here the algorithm where 
the desired frequency of each bin is equal to l/d t at iteration t. 

Algorithm 3 Proposed Density Exploration Algorithm 
1: Run a preliminary exploration of the target e.g. using adaptive MCMC, and determine an energy 
range. 

2: Partition the state space into d regions {<%i j0 , ■ ■ ■ i ^d ,o} along a reaction coordinate £,(x), 

the default choice being £(x) = — log7r(a;). 
3: Mi G {1, . . . , d } set 6(i) <r- 1, v(i) 0. 
4: Choose an initial proposal standard deviation ctq 
5: Choose the frequency r with which to check for a flat histogram. 

6: Choose a decreasing sequence {pt\, typically pt — 1/t, to update the proposal standard deviation. 
7: Choose a decreasing sequence {7/c}, typically 7^ = 1/fc, to update the bias. 
8: Sample Xq ~ ttq, an initial distribution. 
9: for t = 1 to T do 

10: Sample X t from P$ t _ 1 (X t _x, •), a transition kernel with invariant distribution Jl^Li ^8±-i ( x )> 

parametrized by the proposal standard deviation Ot-\. 
11: Update the proposal standard deviation: a t Ct-i + pt (21(A > .234) — 1), 

where A is the last acceptance rate. 
12: Set d t <- d t -\. 

13: Update the proportions: Mi e {1,. .. ,d t } u(i) <- \ (t - l)u(i) + N' 1 Y^=i T x i>t {X[ j) ) . 

14: Every r-th iteration, check the distribution of samples within each bin, extending the range if 

necessary. For example, if if £(2;) = — log7r(a;) and a new minimum value of £(x) was found, 

extend the first bin in order to include this value. 
15: for i 6 {1, . . . , dt} do 
16: if bin i should be split then 

17: Create two sub-bins covering bin i, assign to each a weight equal to 8 t (i)/2. 

18: Set d t <— d t + 1, extend v. 

19: end if 
20: end for 

21: if "flat histogram": max^^ \v(i) — cZ t 1 1 < c/d then 

22: Set k *- k + 1. 

23: Reset Vi € {l,...,dt} v(i) <- 0. 

24: end if 

TTr,^r,f„ kino. ^ ^„ ( ^\ J . t i \ I I AT~l T. . fvU)\ 
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Update the bias: log0 t (i) <- Iogd t _i(i) +7fc(^" 1 Ej=i^ M (^t ) - «*T 

Normalize the bias: 9 t (i) t (i)/J2t=i e S)- 
end for 



B Algorithm Convergence 



The convergence of the Wang-Landau algorithm using a deterministic stepsize, also called the Stochastic 
Approximation Monte Carlo (SAMC) algorithm, and stochastic approximation algorithms in general, 
has been well-studied in Atchade and Liu (2010); Andrieu and Moulines (2006); Andrieu et al. (2005); 
Liang and Wu (2011), see Chapter 7 of Liang et al. (2010) for a recent introduction. Since writing this 
manuscript, we have also learned of recent convergence and ergodicity results for a parallel implementa- 
tion of the SAMC algorithm (Liang and Wu, 2011). However, as noted in Liang and Wu (2011) these 
results fail to explain why the parallel version is more efficient than the single-chain algorithm in practice; 
instead it proves the consistency of the algorithm when the number of iterations goes to infinity, and 
the asymptotic normality of the bias (0t)t>o, for any fixed number of chains. We believe that precise 
statements on the impact of the number of chains upon the stabilization of the bias (0 t )t>o would require 
the analysis of the Feynman-Kac semigroup associated with the algorithm, similar to what is commonly 
used to study the impact of the number of particles in Sequential Monte Carlo methods (Del Moral, 
2004). 

Each of our proposed improvements adds a level of complexity to the proof of the algorithm's consis- 
tency. First and foremost, we are using the Flat Histogram criterion, and thus the usual assumptions on 
the stepsize of the stochastic approximation are not easily verified (e.g. assumptions of Theorem 2.3 in 
Andrieu et al. (2005) and conditions (A4) in Liang and Wu (2011)). Indeed, if no flat histogram criterion 
was met, then the stepsize (7/c)fc>o would stay constant. We rely on a result in Jacob and Ryder (2011) 
that proves that the criterion is met in finite time, for any precision threshold c; therefore the results of 
Andrieu et al. (2005) and thus the results of Liang and Wu (2011) apply even when one uses the flat 
histogram criterion. 

Finally, with our inclusion of an adaptive proposal as a Robbins-Monro style update, the algorithm 
still remains in the class of stochastic approximation algorithms. One could pragmatically stop the 
adaptation of the proposal distribution after some iteration and fall back to the study of a homogeneous 
Metropolis-Hastings algorithm. However, we believe that the algorithm could be studied in the same 
framework as Andrieu and Moulines (2006); Liang and Wu (2011), where now the stochastic approxima- 
tion would both control the bias (6t)t>o and the standard deviation of the proposal (at)t>o- 

C Trimodal Target Example 

We introduce a toy target distribution to aid in demonstrating some of the concepts discussed earlier, 
especially the bin splitting strategy. Consider the 2-dimensional trimodal target described in Liang et al. 
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Figure 9: Trimodal example: log density function of the target distribution (4). The modes are separated 
by areas where the log density is very low, making exploration difficult. 

(2007): 




a mixture of three bivariate Gaussian distributions. The corresponding log density is shown in Figure 9. 
This density, while low-dimensional and with only three modes, is known to be difficult to sample from 
with Metropolis-Hastings (e.g. Gilks et al. 1998). Firstly, with different correlation structures in each 
mode, an adaptive algorithm might conform to one mode, missing (or poorly sampling from) the other 
two. Secondly, there is a low-density region separating each mode; as such, low-variance proposals might 
be incapable of jumping between modes. Figure 10 displays the biased target (Equation (1), using the log 
density as reaction coordinate) and one of its marginals, emphasizing the effect of biasing in improving 
the ability for the algorithm to explore the density. Here the plot is created using computationally 
expensive fine-scale numerical integration. 

Initial exploration is performed by an adaptive MCMC algorithm, run with 2 parallel chains and 
500 iterations. The proposal of this first run targets a specific acceptance rate of 23.4%, as described 
in Section 3.4. The explored energy range is expanded and divided into do — 3 initial bins. In all 
examples, we use c = 0.5. The proposed algorithm is run with 2 parallel chains for 2500 iterations. 
Figure 11(a) shows the regions recovered by the chains; the chains have moved easily between the modes, 
even if the distribution of the starting point was not well spread over the target density support. In this 
case, to reflect the possible lack of information about the target support, we draw the starting points 
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(a) Biased target density 



(b) Marginal of biased and unbiased target density 



Figure 10: Trimodal example: (left) log density of the biased target distribution (Equation (1)). The 
former modes are now all located in a fairly flat region, allowing for straightforward exploration. (Right) 
marginal log density of one component of the (symmetric) trimodal target. The solid line shows the 
target probability density function and the dashed line shows an approximation of the marginal of the 
biased distribution of Equation (1). The biased marginal is flatter, hence easier to explore than the 
original target distribution. 

of all the chains from a Af(0, 0.1 x I 2 ) distribution, hence exclusively in one of the three modes. In this 
setting the free energy SMC method described in Chopin and Jacob (2010) fails to recover the target 
distribution accurately; specifically, the central mode is over-sampled due to many particles not reaching 
the outer modes. However, if the initial distribution ttq is well spread over the target support, the SMC 
algorithm recovers the modes. Figure 11(b) shows the points generated by 3000 iterations of the adaptive 
Metropolis-Hastings algorithm (already used in the initial exploration), also using 2 chains. We see that 
the exploration was less successful, with the bottom left mode hardly visited at all, although the same 
number of point-wise density evaluations were performed as for the proposed algorithm. 

Along the iterations, the bins have been split three times. Here the chosen strategy was to split a 
bin if less than 25% of its points were situated in half of the bin. Figure 12 illustrates the effects of 
the binning strategy. Figure 12(a) shows the trace plot of the estimators 9, and the iterations at which 
bins are split are shown with vertical lines. After each split, the dimension of 9 increases but the figure 
shows that the new estimators quickly stabilize. After the last split around iteration 450, the number of 
bins stays constant. Figure 12(b) shows the histogram of the log density evaluations of the chain points, 
with vertical full lines showing the initial bins, and vertical dashed lines showing the bins that have been 
added during the run. We see that the bin splits induce more uniformity within bins, and hence across 
the entire reaction coordinate range, aiding in movement and exploration. 
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(a) Scatter plot of the chains generated by the (b) Scatter plot of the chains generated by an 
proposed algorithm adaptive Metropolis-Hastings algorithm 

Figure 11: Trimodal example: results of the proposed algorithm: (a) Scatter plot of all samples, before 
normalization/importance sampling, (b) Scatter plot of samples generated by an adaptive Metropolis- 
Hastings algorithm using the same number of chains and iterations. 
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(a) Trace plot of log 6, the log penalties, with vertical lines indicating the bin splits. 
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(b) Histogram of the energy values computed during the algorithm. Vertical full lines show the initial bins and dashed lines 
show the cuts that have been added dynamically. 

Figure 12: Trimodal example: histograms of the log density values of all the chain points just before 
the iterations at which the splitting mechanism is triggered. The number of bins increases automatically 
along the iterations. 
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